<HTML>
<HEAD>
    <TITLE>Conservative GC Porting Directions</TITLE>
</HEAD>
<BODY>
<H1>Conservative GC Porting Directions</h1>
The collector is designed to be relatively easy to port, but is not
portable code per se.  The collector inherently has to perform operations,
such as scanning the stack(s), that are not possible in portable C code.
<P>
All of the following assumes that the collector is being ported to a
byte-addressable 32- or 64-bit machine.  Currently all successful ports
to 64-bit machines involve LP64 targets.  The code base includes some
provisions for P64 targets (notably win64), but that has not been tested.
You are hereby discouraged from attempting a port to non-byte-addressable,
or 8-bit, or 16-bit machines.
<P>
The difficulty of porting the collector varies greatly depending on the needed
functionality.  In the simplest case, only some small additions are needed
for the <TT>include/private/gcconfig.h</tt> file.  This is described in the
following section.  Later sections discuss some of the optional features,
which typically involve more porting effort.
<P>
Note that the collector makes heavy use of <TT>ifdef</tt>s.  Unlike
some other software projects, we have concluded repeatedly that this is preferable
to system dependent files, with code duplicated between the files.
However, to keep this manageable, we do strongly believe in indenting
<TT>ifdef</tt>s correctly (for historical reasons usually without the leading
sharp sign).  (Separate source files are of course fine if they don't result in
code duplication.)
<H2>Adding Platforms to <TT>gcconfig.h</tt></h2>
If neither thread support, nor tracing of dynamic library data is required,
these are often the only changes you will need to make.
<P>
The <TT>gcconfig.h</tt> file consists of three sections:
<OL>
<LI> A section that defines GC-internal macros
that identify the architecture (e.g. <TT>IA64</tt> or <TT>I386</tt>)
and operating system (e.g. <TT>LINUX</tt> or <TT>MSWIN32</tt>).
This is usually done by testing predefined macros.  By defining
our own macros instead of using the predefined ones directly, we can
impose a bit more consistency, and somewhat isolate ourselves from
compiler differences.
<P>
It is relatively straightforward to add a new entry here.  But please try
to be consistent with the existing code.  In particular, 64-bit variants
of 32-bit architectures general are <I>not</i> treated as a new architecture.
Instead we explicitly test for 64-bit-ness in the few places in which it
matters.  (The notable exception here is <TT>I386</tt> and <TT>X86_64</tt>.
This is partially historical, and partially justified by the fact that there
are arguably more substantial architecture and ABI differences here than
for RISC variants.)
<P>
on GNU-based systems, <TT>cpp -dM empty_source_file.c</tt> seems to generate
a set of predefined macros.  On some other systems, the "verbose"
compiler option may do so, or the manual page may list them.
<LI>
A section that defines a small number of platform-specific macros, which are
then used directly by the collector.  For simple ports, this is where most of
the effort is required.  We describe the macros below.
<P>
This section contains a subsection for each architecture (enclosed in a
suitable <TT>ifdef</tt>.  Each subsection usually contains some
architecture-dependent defines, followed by several sets of OS-dependent
defines, again enclosed in <TT>ifdef</tt>s.
<LI>
A section that fills in defaults for some macros left undefined in the preceding
section, and defines some other macros that rarely need adjustment for
new platforms.  You will typically not have to touch these.
If you are porting to an OS that
was previously completely unsupported, it is likely that you will
need to add another clause to the definition of <TT>GET_MEM</tt>.
</ol>
The following macros must be defined correctly for each architecture and operating
system:
<DL>
<DT><TT>MACH_TYPE</tt>
<DD>
Defined to a string that represents the machine architecture.  Usually
just the macro name used to identify the architecture, but enclosed in quotes.
<DT><TT>OS_TYPE</tt>
<DD>
Defined to a string that represents the operating system name.  Usually
just the macro name used to identify the operating system, but enclosed in quotes.
<DT><TT>CPP_WORDSZ</tt>
<DD>
The word size in bits as a constant suitable for preprocessor tests,
i.e. without casts or sizeof expressions.  Currently always defined as
either 64 or 32.  For platforms supporting both 32- and 64-bit ABIs,
this should be conditionally defined depending on the current ABI.
There is a default of 32.
<DT><TT>ALIGNMENT</tt>
<DD>
Defined to be the largest <TT>N</tt>, such that
all pointer are guaranteed to be aligned on <TT>N</tt>-byte boundaries.
defining it to be 1 will always work, but perform poorly.
For all modern 32-bit platforms, this is 4.  For all modern 64-bit
platforms, this is 8.  Whether or not X86 qualifies as a modern
architecture here is compiler- and OS-dependent.
<DT><TT>DATASTART</tt>
<DD>
The beginning of the main data segment.  The collector will trace all
memory between <TT>DATASTART</tt> and <TT>DATAEND</tt> for root pointers.
On some platforms,this can be defined to a constant address,
though experience has shown that to be risky.  Ideally the linker will
define a symbol (e.g. <TT>_data</tt> whose address is the beginning
of the data segment.  Sometimes the value can be computed using
the <TT>GC_SysVGetDataStart</tt> function.  Not used if either
the next macro is defined, or if dynamic loading is supported, and the
dynamic loading support defines a function
<TT>GC_register_main_static_data()</tt> which returns false.
<DT><TT>SEARCH_FOR_DATA_START</tt>
<DD>
If this is defined <TT>DATASTART</tt> will be defined to a dynamically
computed value which is obtained by starting with the address of
<TT>_end</tt> and walking backwards until non-addressable memory is found.
This often works on Posix-like platforms.  It makes it harder to debug
client programs, since startup involves generating and catching a
segmentation fault, which tends to confuse users.
<DT><TT>DATAEND</tt>
<DD>
Set to the end of the main data segment.  Defaults to <TT>end</tt>,
where that is declared as an array.  This works in some cases, since
the linker introduces a suitable symbol.
<DT><TT>DATASTART2, DATAEND2</tt>
<DD>
Some platforms have two discontiguous main data segments, e.g.
for initialized and uninitialized data.  If so, these two macros
should be defined to the limits of the second main data segment.
<DT><TT>STACK_GROWS_UP</tt>
<DD>
Should be defined if the stack (or thread stacks) grow towards higher
addresses.  (This appears to be true only on PA-RISC.  If your architecture
has more than one stack per thread, and is not already supported, you will
need to do more work.  Grep for "IA64" in the source for an example.)
<DT><TT>STACKBOTTOM</tt>
<DD>
Defined to be the cool end of the stack, which is usually the
highest address in the stack.  It must bound the region of the
stack that contains pointers into the GC heap.  With thread support,
this must be the cold end of the main stack, which typically
cannot be found in the same way as the other thread stacks.
If this is not defined and none of the following three macros
is defined, client code must explicitly set
<TT>GC_stackbottom</tt> to an appropriate value before calling
<TT>GC_INIT()</tt> or any other <TT>GC_</tt> routine.
<DT><TT>LINUX_STACKBOTTOM</tt>
<DD>
May be defined instead of <TT>STACKBOTTOM</tt>.
If defined, then the cold end of the stack will be determined
Currently we usually read it from /proc.
<DT><TT>HEURISTIC1</tt>
<DD>
May be defined instead of <TT>STACKBOTTOM</tt>.
<TT>STACK_GRAN</tt> should generally also be undefined and defined.
The cold end of the stack is determined by taking an address inside
<TT>GC_init's frame</tt>, and rounding it up to
the next multiple of <TT>STACK_GRAN</tt>.  This works well if the stack base is
always aligned to a large power of two.
(<TT>STACK_GRAN</tt> is predefined to 0x1000000, which is
rarely optimal.)
<DT><TT>HEURISTIC2</tt>
<DD>
May be defined instead of <TT>STACKBOTTOM</tt>.
The cold end of the stack is determined by taking an address inside
GC_init's frame, incrementing it repeatedly
in small steps (decrement if <TT>STACK_GROWS_UP</tt>), and reading the value
at each location.  We remember the value when the first
Segmentation violation or Bus error is signalled, round that
to the nearest plausible page boundary, and use that as the
stack base.
<DT><TT>DYNAMIC_LOADING</tt>
<DD>
Should be defined if <TT>dyn_load.c</tt> has been updated for this
platform and tracing of dynamic library roots is supported.
<DT><TT>MPROTECT_VDB, PROC_VDB</tt>
<DD>
May be defined if the corresponding "virtual dirty bit"
implementation in os_dep.c is usable on this platform.  This
allows incremental/generational garbage collection.
<TT>MPROTECT_VDB</tt> identifies modified pages by
write protecting the heap and catching faults.
<TT>PROC_VDB</tt> uses the /proc primitives to read dirty bits.
<DT><TT>PREFETCH, PREFETCH_FOR_WRITE</tt>
<DD>
The collector uses <TT>PREFETCH</tt>(<I>x</i>) to preload the cache
with *<I>x</i>.
This defaults to a no-op.
<DT><TT>CLEAR_DOUBLE</tt>
<DD>
If <TT>CLEAR_DOUBLE</tt> is defined, then
<TT>CLEAR_DOUBLE</tt>(x) is used as a fast way to
clear the two words at GC_malloc-aligned address x.  By default,
word stores of 0 are used instead.
<DT><TT>HEAP_START</tt>
<DD>
<TT>HEAP_START</tt> may be defined as the initial address hint for mmap-based
allocation.
<DT><TT>ALIGN_DOUBLE</tt>
<DD>
Should be defined if the architecture requires double-word alignment
of <TT>GC_malloc</tt>ed memory, e.g. 8-byte alignment with a
32-bit ABI.  Most modern machines are likely to require this.
This is no longer needed for GC7 and later.
</dl>
<H2>Additional requirements for a basic port</h2>
In some cases, you may have to add additional platform-specific code
to other files.  A likely candidate is the implementation of
<TT>GC_with_callee_saves_pushed</tt> in </tt>mach_dep.c</tt>.
This ensure that register contents that the collector must trace
from are copied to the stack.  Typically this can be done portably,
but on some platforms it may require assembly code, or just
tweaking of conditional compilation tests.
<P>
For GC7, if your platform supports <TT>getcontext()</tt>, then definining
the macro <TT>UNIX_LIKE</tt> for your OS in <TT>gcconfig.h</tt>
(if it isn't defined there already) is likely to solve the problem.
otherwise, if you are using gcc, <TT>_builtin_unwind_init()</tt>
will be used, and should work fine.  If that is not applicable either,
the implementation will try to use <TT>setjmp()</tt>.  This will work if your
<TT>setjmp</tt> implementation saves all possibly pointer-valued registers
into the buffer, as opposed to trying to unwind the stack at
<TT>longjmp</tt> time.  The <TT>setjmp_test</tt> test tries to determine this,
but often doesn't get it right.
<P>
In GC6.x versions of the collector, tracing of registers
was more commonly handled
with assembly code.  In GC7, this is generally to be avoided.
<P>
Most commonly <TT>os_dep.c</tt> will not require attention, but see below.
<H2>Thread support</h2>
Supporting threads requires that the collector be able to find and suspend
all threads potentially accessing the garbage-collected heap, and locate
any state associated with each thread that must be traced.
<P>
The functionality needed for thread support is generally implemented
in one or more files specific to the particular thread interface.
For example, somewhat portable pthread support is implemented
in <TT>pthread_support.c</tt> and <TT>pthread_stop_world.c</tt>.
The essential functionality consists of
<DL>
<DT><TT>GC_stop_world()</tt>
<DD>
Stops all threads which may access the garbage collected heap, other
than the caller.
<DT><TT>GC_start_world()</tt>
<DD>
Restart other threads.
<DT><TT>GC_push_all_stacks()</tt>
<DD>
Push the contents of all thread stacks (or at least of pointer-containing
regions in the thread stacks) onto the mark stack.
</dl>
These very often require that the garbage collector maintain its
own data structures to track active threads.
<P>
In addition, <TT>LOCK</tt> and <TT>UNLOCK</tt> must be implemented
in <TT>gc_locks.h</tt>
<P>
The easiest case is probably a new pthreads platform
on which threads can be stopped
with signals.  In this case, the changes involve:
<OL>
<LI>Introducing a suitable <TT>GC_</tt><I>X</i><TT>_THREADS</tt> macro, which should
be automatically defined by <TT>gc_config_macros.h</tt> in the right cases.
It should also result in a definition of <TT>GC_PTHREADS</tt>, as for the
existing cases.
<LI>For GC7+, ensuring that the <TT>atomic_ops</tt> package at least
minimally supports the platform.
If incremental GC is needed, or if pthread locks don't
perform adequately as the allocation lock, you will probably need to
ensure that a sufficient <TT>atomic_ops</tt> port
exists for the platform to provided an atomic test and set
operation.  (Current GC7 versions require more<TT>atomic_ops</tt>
asupport than necessary.  This is a bug.)  For earlier versions define
<TT>GC_test_and_set</tt> in <TT>gc_locks.h</tt>.
<LI>Making any needed adjustments to <TT>pthread_stop_world.c</tt> and
<TT>pthread_support.c</tt>.  Ideally none should be needed.  In fact,
not all of this is as well standardized as one would like, and outright
bugs requiring workarounds are common.
</ol>
Non-preemptive threads packages will probably require further work.  Similarly
thread-local allocation and parallel marking requires further work
in <TT>pthread_support.c</tt>, and may require better <TT>atomic_ops</tt>
support.
<H2>Dynamic library support</h2>
So long as <TT>DATASTART</tt> and <TT>DATAEND</tt> are defined correctly,
the collector will trace memory reachable from file scope or <TT>static</tt>
variables defined as part of the main executable.  This is sufficient
if either the program is statically linked, or if pointers to the
garbage-collected heap are never stored in non-stack variables
defined in dynamic libraries.
<P>
If dynamic library data sections must also be traced, then
<UL>
<LI><TT>DYNAMIC_LOADING</tt> must be defined in the appropriate section
of <TT>gcconfig.h</tt>.
<LI>An appropriate versions of the functions
<TT>GC_register_dynamic_libraries()</tt> should be defined in
<TT>dyn_load.c</tt>.  This function should invoke
<TT>GC_cond_add_roots(</tt><I>region_start, region_end</i><TT>, TRUE)</tt>
on each dynamic library data section.
</ul>
<P>
Implementations that scan for writable data segments are error prone, particularly
in the presence of threads.  They frequently result in race conditions
when threads exit and stacks disappear.  They may also accidentally trace
large regions of graphics memory, or mapped files.  On at least
one occasion they have been known to try to trace device memory that
could not safely be read in the manner the GC wanted to read it.
<P>
It is usually safer to walk the dynamic linker data structure, especially
if the linker exports an interface to do so.  But beware of poorly documented
locking behavior in this case.
<H2>Incremental GC support</h2>
For incremental and generational collection to work, <TT>os_dep.c</tt>
must contain a suitable "virtual dirty bit" implementation, which
allows the collector to track which heap pages (assumed to be
a multiple of the collectors block size) have been written during
a certain time interval.  The collector provides several
implementations, which might be adapted.  The default
(<TT>DEFAULT_VDB</tt>) is a placeholder which treats all pages
as having been written.  This ensures correctness, but renders
incremental and generational collection essentially useless.
<H2>Stack traces for debug support</h2>
If stack traces in objects are need for debug support,
<TT>GC_dave_callers</tt> and <TT>GC_print_callers</tt> must be
implemented.
<H2>Disclaimer</h2>
This is an initial pass at porting guidelines.  Some things
have no doubt been overlooked.
</body>
</html>
